-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob][Feature] add light weight job submitter in kuberay image #2587
base: master
Are you sure you want to change the base?
Conversation
6a0092f
to
9483127
Compare
Just to note that although both PRs can solve the duplicate submission issue, this lightweight committer can further shorten startup duration by a smaller image. |
Makes sense, but I'm concerned about kuberay operator image becoming a dependency at the cluster / job level. If we think this is worth doing, we should probably create a new image |
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
2bdb064
to
d274346
Compare
66302f3
to
d274346
Compare
Signed-off-by: Rueian <[email protected]>
d274346
to
e3fb564
Compare
Hi @kevin85421, I have used a new GitHub action job to build a dedicated image for the submitter but the job requires credentials which I believe are only available after merging the PR. Do you have any suggested way to test the GitHub action job before merging the PR? Or probably we just merge it first? |
IMO I don't think we need this with #2579 merged. Or at least we can revisit after v1.3 based on user feedback |
The lightweight job submitter still has its own benefits (e.g., much faster image pulling), but I agree that we can revisit this based on the feedback from v1.3 to determine if the image pulling overhead of the K8s Job Submitter is problematic. If users always run the submitter on a K8s node that caches the Ray image, the lightweight submitter may not be necessary. |
Why are these changes needed?
Currently, noted in the issue #2537, when a user comes with a
RayJob
CR, KubeRay uses the same image as the RayCluster to start another container to submit the Ray Job. However, if the container runs on a node without the image preloaded, it takes a long time to download the image and start since the image is usually large.This PR adds a light submitter (45MB) that mimics the
ray job submit
behavior (submit + tail logs) into the KubeRay image which is usually smaller than the image used in the RayCluster. Users can try it with thesubmitterPodTemplate
in their RayJob CR.Example RayJob CR yaml:
And, this submitter will not fail when the job has already been submitted thus will also solve #2154.
Related issue number
#2537
Checks